Acquiring a Poor Man's Inflectional Lexicon for German
نویسنده
چکیده
Many NLP modules and applications such as morphosyntactic corpus annotation tools require the availability of a module for wide-coverage inflectional analysis. One way to provide such analyses is to look up the word form in an inflectional lexicon. Such a lexicon should list stems and their inflectional classes instead of the full forms for a better maintainability. To my knowledge, there is no such lexicon freely available for German. Furthermore, existing inflectional lexicons need to be expanded, for instance, to encompass domain-specific vocabulary. The manual creation and maintenance of an inflectional lexicon is a dull and strenuous task. Since large text corpora nowadays are easily available and inflectional systems are in general well understood, it seems feasible to acquire lexical data from raw texts, guided by our knowledge of inflection. Several such methods have been developed in recent years for different languages including Croatian, Russian, French, and Slovak (see references). I present an acquisition method along these lines for German. The general idea can be roughly summarised as follows: first, generate a set of lexical entry hypotheses for each word-form in the corpus; then, select hypotheses that explain the word-forms found in the corpus “best”. To this end, I have turned an existing morphological grammar, cast in finite-state technology (Schmid et al., 2004), into a hypothesiser for lexical entries. Irregular forms are simply listed so that they do not interfere with the regular rules used in the hypothesiser. Running the hypothesiser on a text corpus yields a large number of lexical entry hypotheses. These are then ranked according to their validity with the help of a statistical model that is based on the number of attested and predicted word forms for each hypothesis. First results of the system are promising; e.g., »50% precision and > 75% recall are achieved for verbs.
منابع مشابه
A Language-independent Approach to Extracting Derivational Relations from an Inflectional Lexicon
In this paper, we describe and evaluate an unsupervised method for acquiring pairs of lexical entries belonging to the same morphological family, i.e., derivationally related words, starting from a purely inflectional lexicon. Our approach relies on transformation rules that relate lexical entries with the one another, and which are automatically extracted from the inflected lexicon based on su...
متن کاملAutomatic Lexical Acquisition for German Based on Morphological Paradigms Diploma Thesis Proposal
The general aim of my diploma thesis is to develop a (semi-)automatic method for the acquisition of a German inflectional lexicon from raw texts. In particular, I want to explore whether inflectional stems can be deduced from word-form occurences that fit into known morphological paradigm classes.
متن کاملDeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German
We introduce DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German developed within the Alexina framework. We extracted lexical information from the German wiktionary and developed a morphological inflection grammar for German, based on a linguistically sound model of inflectional morphology. Although the developement of DeLex involved some manual wor...
متن کاملResource-Light Acquisition of Inflectional Paradigms
This paper presents a resource-light acquisition of morphological paradigms and lexicon for fusional languages. It builds upon Paramor [10], an unsupervised system, by extending it: (1) to accept a small seed of manually provided word inflections with marked morpheme boundary; (2) to handle basic allomorphic changes acquiring the rules from the seed and/or from previously acquired paradigms. Th...
متن کاملAutomatic Extraction of Morphological Lexicons from Morphologically Annotated Corpora
We present a method for automatically learning inflectional classes and associated lemmas from morphologically annotated corpora. The method consists of a core languageindependent algorithm, which can be optimized for specific languages. The method is demonstrated on Egyptian Arabic and German, two morphologically rich languages. Our best method for Egyptian Arabic provides an error reduction o...
متن کامل